feat: add JoyImage edit plus#14032
Conversation
|
Hi @tangyanf, thanks for the PR! It does not appear to link an issue it fixes. If this PR addresses an existing issue, please add a closing keyword (e.g. |
There was a problem hiding this comment.
🤗 Serge says:
This PR adds the JoyImage Edit Plus model and pipeline. There are several blocking issues that need to be addressed before merging.
Blocking — Debug artifacts left in production code
Multiple torch.save() calls, a print() statement, and a commented-out exit(0) are left in pipeline_joyimage_edit_plus.py. These will write files to the user's working directory and print to stdout during every inference call.
Blocking — einops dependency
Per .ai/models.md: "No new mandatory dependency without discussion (e.g. einops). Optional deps guarded with is_X_available() and a dummy in utils/dummy_*.py." The pipeline directly imports from einops import rearrange — this is the only non-comment usage of einops in src/diffusers/. The rearrange calls should be rewritten with native PyTorch (reshape, permute, unflatten).
Blocking — sglang integration code in model forward
The transformer's forward method contains sglang-specific code: list-unwrapping for "SglangXvideo CFG branches" (lines 272-276) and a try: from sglang... fallback (lines 279-287). Per .ai/AGENTS.md: "No defensive code, unused code paths, or legacy stubs — do not add fallback paths, safety checks, or configuration options 'just in case'." This code doesn't belong in the diffusers model — the pipeline always passes the required arguments.
Blocking — Missing dummy objects
JoyImageEditPlusTransformer3DModel, JoyImageEditPlusPipeline, and JoyImageEditPlusPipelineOutput are not registered in dummy_pt_objects.py / dummy_torch_and_transformers_objects.py. This will cause ImportError when torch/transformers are not installed.
Blocking — Missing tests
No test files were added for the new model or pipeline.
Blocking — Hardcoded device_type="cuda" in torch.autocast
torch.autocast(device_type="cuda", ...) is hardcoded in two places in the pipeline. This will fail on MPS, XPU, and other non-CUDA devices.
Non-blocking — Inlined scheduler sigma math
Per .ai/pipelines.md gotcha #3, the pipeline manually computes shifted sigmas and temporarily overrides self.scheduler.shift — this is exactly what FlowMatchEulerDiscreteScheduler does with its shift config. The scheduler should own this logic.
Non-blocking — Unused imports and parameters
import inspectintransformer_joyimage_edit_plus.pyis unused.enable_denormalizationparameter is declared inprepare_latentsand__call__but never read.retrieve_timestepsis duplicated from the existing pipeline without a# Copied fromannotation.
serge v0.1.0 · model: claude-opus-4-6 · 29 LLM turns · 50 tool calls · 190.2s · 1602502 in / 7369 out tokens
- Remove einops dependency: replace rearrange with reshape/permute
- Remove sglang-specific code from transformer forward
- Remove unused import inspect from transformer
- Fix hardcoded device_type="cuda" to use device.type
- Simplify scheduler sigma math: delegate to retrieve_timesteps
- Remove unused enable_denormalization parameter
- Fix callback latents variable binding
- Fix output_type="pt" to return stacked tensor
- Set return_dict default to True in transformer forward
- Add dummy objects for JoyImageEditPlus classes
- Add transformer and pipeline test files
6f2763a to
8a911e5
Compare
|
@claude can you do a review here? |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks, i left some feedbacks
@yiyixuxu Thank you for taking the time to review this PR! I've addressed all the feedback — here's a summary of the changes:
Please let me know if there's anything else that needs to be updated! |
re check |
|
@tarekziade Hi! I've addressed all the feedback from sergereview[bot] in my latest commits. However, the bot's "changes requested" review is still blocking the merge. Could you help me with one of the following:
Thanks! |
…t_to_diffusers.py
JoyImage Edit and Edit Plus share identical VAE and transformer weight
layouts — only the target model class differs. Consolidate both into a
single script with a --model_type flag (edit | edit_plus) instead of
maintaining two nearly-duplicate files.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@bot /style |
|
Style bot fixed some files and pushed the changes. |
- Add missing Returns section to JoyImageEditPlusTransformer3DModel.forward docstring - Fix alphabetical ordering of dummy classes in dummy_pt_objects.py
Replace `torch.autocast(dtype=torch.float32)` + `.float()` with `.to(self.vae.dtype)` for both VAE encode and decode calls. The previous approach caused dtype mismatch (float32 input vs bfloat16 bias) on CPU where autocast does not automatically cast conv weights, breaking the CI `test_layerwise_casting_inference` test.
dg845
left a comment
There was a problem hiding this comment.
Thanks for the PR! Left some design comments :).
| if image_rotary_emb is not None: | ||
| vis_freqs, txt_freqs = image_rotary_emb | ||
| if vis_freqs is not None: | ||
| img_query, img_key = _apply_rotary_emb_batched(img_query, img_key, vis_freqs) | ||
| if txt_freqs is not None: | ||
| txt_query, txt_key = _apply_rotary_emb_batched(txt_query, txt_key, txt_freqs) |
There was a problem hiding this comment.
Since we always set the text RoPE embeddings to None in JoyImageEditPlusTransformer3DModel.forward, I think we can simplify the code here to remove the text RoPE path and image_rotary_emb to only contain the image RoPE cos and sin components (which would also match the current variables names better).
| @register_to_config | ||
| def __init__( | ||
| self, | ||
| patch_size: list = [1, 2, 2], |
There was a problem hiding this comment.
| patch_size: list = [1, 2, 2], | |
| patch_size: list[int] = [1, 2, 2], |
nit: improve type hint
| self.patch_size = patch_size | ||
| self.hidden_size = hidden_size | ||
| self.num_attention_heads = num_attention_heads | ||
| self.rope_dim_list = rope_dim_list | ||
| self.rope_type = rope_type | ||
| self.theta = theta |
There was a problem hiding this comment.
I don't think we need to define these attributes explicitly since the __init__ arguments are already registered to the config. For example, we could use self.config.hidden_size in _get_rotary_pos_embed_for_range without needing to define self.hidden_size here.
| self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor_spatial) | ||
| self.vae_image_processor = JoyImageEditImageProcessor( | ||
| vae_scale_factor=self.vae_scale_factor_spatial, | ||
| ) |
There was a problem hiding this comment.
| self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor_spatial) | |
| self.vae_image_processor = JoyImageEditImageProcessor( | |
| vae_scale_factor=self.vae_scale_factor_spatial, | |
| ) | |
| self.image_processor = JoyImageEditImageProcessor(vae_scale_factor=self.vae_scale_factor_spatial) |
Only keep the JoyImageEditImageProcessor (since the VaeImageProcessor is unused) and rename it to self.image_processor, which is the standard diffusers name.
| padding = torch.zeros((x.shape[0], padding_length), dtype=x.dtype, device=x.device) | ||
| return torch.cat([x, padding], dim=1) | ||
|
|
||
| def normalize_latents(self, latent: torch.Tensor) -> torch.Tensor: |
There was a problem hiding this comment.
I think we should inline normalize_latents and denormalize_latents, as they are only called once and this would follow the diffusers design better. We should also be able to simplify the code here as AutoencoderKLWan should always have latents_mean and latents_std in its config.
|
|
||
| # Prepare latents (patchified) | ||
| num_channels_latents = self.transformer.config.in_channels | ||
| padded_latents, target_mask, shape_list = self.prepare_latents( |
There was a problem hiding this comment.
| padded_latents, target_mask, shape_list = self.prepare_latents( | |
| latents, target_mask, shape_list = self.prepare_latents( |
nit: I think it would be more clear if we use latents throughout.
| guidance_scale: float = 4.0, | ||
| negative_prompt: str | list[str] | None = None, | ||
| generator: torch.Generator | list[torch.Generator] | None = None, | ||
| latents: torch.Tensor | None = None, |
There was a problem hiding this comment.
Would it be possible to support pre-computed latents (e.g. in prepare_latents)? Most of the other pipelines support this.
| # Zero out padding text tokens to prevent them from corrupting attention | ||
| # (original uses explicit attention masking; here we neutralize padding values) | ||
| if prompt_embeds_mask is not None: | ||
| prompt_embeds = prompt_embeds * prompt_embeds_mask.unsqueeze(-1) |
There was a problem hiding this comment.
Why can't we use an attention mask here?
| # img_tensor: [C, T, H, W] -> [C, H, W] (T=1) | ||
| img_tensor = img_tensor[:, 0] | ||
| if output_type == "pil": | ||
| img_np = (img_tensor.permute(1, 2, 0).cpu().float().numpy() * 255).clip(0, 255).astype(np.uint8) |
There was a problem hiding this comment.
Would it be possible to use JoyImageEditImageProcessor.postprocess to postprocess the decoded VAE outputs? postprocess should be able to handle the output_type logic as well.
| if not return_dict: | ||
| return image |
There was a problem hiding this comment.
| if not return_dict: | |
| return image | |
| if not return_dict: | |
| return (image,) |
nit: if return_dict=False, we usually return a tuple, even if it only has one element. See for example Flux 2:
diffusers/src/diffusers/pipelines/flux2/pipeline_flux2.py
Lines 1031 to 1032 in 72eb60c
Description
We are the JoyAI Team, and this is the Diffusers implementation for the JoyAI-Image-Edit-Plus model.
GitHub Repository: [https://github.com/jd-opensource/JoyAI-Image]
Hugging Face Model: [https://huggingface.co/jdopensource/JoyAI-Image-Edit-Plus-Diffusers]
Original opensource weights: [https://huggingface.co/jdopensource/JoyAI-Image-Edit-Plus]
Fixes #14049
Model Overview
JoyAI-Image-Edit-Plus extends JoyAI-Image-Edit with multi-image editing capabilities. While JoyAI-Image-Edit operates on a single reference image, Edit-Plus accepts multiple reference
images as input and performs instruction-guided editing across them — enabling tasks such as subject composition, style transfer from multiple sources, and multi-view consistent editing.
It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT), supporting variable-resolution reference images that are independently
encoded and jointly denoised.
Key Features
and dog images).